Spark: RDD Optimization Techniques

Caching: It is one mechanism to speed up applications that access the same RDD multiple times. An RDD that is not cached, nor checkpointed, is re-evaluated again each time an action is invoked on that RDD. There are two function calls for caching an RDD: cache() and persist(level: StorageLevel). The difference among them is that cache() will cache the RDD into memory, whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level.persist() without an argument is equivalent with cache().

RDD.cache is also a lazy operation.
If you run textFile.count the first time, the file will be loaded, cached, and counted.
If you call textFile.count a second time, the operation will use the cache.
It will just take the data from the cache and count the lines.text.cache

The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
Checkpointing

Checkpointing stores the rdd physically to hdfs and destroys the lineage that created it.
The checkpoint file won't be deleted even after the Spark application terminated.
Checkpoint files can be used in subsequent job run or driver program
Checkpointing an RDD causes double computation because the operation will first call a cache before doing the actual job of computing and writing to the checkpoint directory.

checkpoint(), on the other hand, breaks lineage and forces data frame to be stored on disk. Unlike usage of cache()/persist(), frequent check-pointing can slow down your program. Checkpoints are recommended to use when a) working in an unstable environment to allow fast recovery from failures b) storing intermediate states of calculation when new entries of the RDD are dependent on the previous entries i.e. to avoid recalculating a long dependency chain in case of failure

Spark

RDD Optimization Techniques

No comments:

Post a Comment